The major dilemma of a salesman happens to guess the price range the customer is looking for to purchase a particular product. It is sometimes considered to be rude to directly ask for a customer's budget. Hence, Analytics Educator is trying to help build a predictive model to predict the total amount that customers are willing to pay. In this case the we have taken the data of cars and our predictive algorithm will help us understand the price range the customer is looking to buy the car at. We have a dataset with the following variables:
The model should predict:
We will be using two of the most important and robust technique of modern data science industry to predict the model. The two algorithms are Artificial Neural Network and Extreme Gradient Boosting. Once done, we will compare the results to see which algorithm has given a better accuracy.
A subset of machine learning methods called Artificial Neural Networks (ANN) is designed to mimic the structure and operation of the human brain. They are made to identify intricate data patterns and generate predictions based on that analysis. Due to its performance in a variety of applications, including image recognition, natural language processing, and speech recognition, as well as their capacity to process massive volumes of data fast and accurately, ANN have become quite popular.
ANNs are made up mostly of layers of interconnected nodes, also referred to as artificial neurons. These neurons take in information, process it mathematically, and then send their results to the layer of neurons below them. ANNs modify their weights and biases through a technique known as backpropagation to enhance their performance on a specific task. As a result, a strong and adaptable machine learning system that can be trained to handle many challenging issues has been created.
The performance of ANNs has significantly improved, and their range of potential applications has grown, in recent years as a result of advancements in processing power and data accessibility. As a result, ANNs are now an essential tool for scientists, engineers, and researchers working in a variety of fields.
Artificial Neural Networks (ANN) are made up of interconnected nodes that process input, output results, and are modelled after the structure and operation of the human brain. In order to perform better on a particular job, ANNs employ a learning method to modify the weights, or the strength of connections between nodes.
An artificial neuron, which receives input from other neurons and generates an output, is the fundamental component of an ANN. Each input is multiplied by a weight before being added together as a whole. The sum is multiplied by a bias factor, and the result is then run through an activation function to determine the neuron's output. Other neurons in the subsequent layer receive this output after that.
The input layer of an ANN receives data and passes it to one or more hidden layers. ANNs are organized into layers. The network's ultimate output is produced by the output layer. The network is shown instances of input data and their matching outputs during training. Based on the discrepancy between the projected output and the actual output, the network's weights and biases are modified. Backpropagation is a technique used to boost the network's efficiency when doing the activity.
Once trained, the ANN can be applied to new data to produce predictions. The input data is sent through the network, and the weights and biases that were discovered during training are used to determine the output. Numerous tasks, including audio and picture identification, natural language processing, and financial forecasting, have been successfully completed with ANNs.
Extreme Gradient Boosting, or XGBoost, is a potent and well-liked open-source machine learning framework used for supervised learning issues including regression and classification. It is a distributed gradient boosting library that has been optimised to be very effective, adaptable, and portable.
Each weak decision tree in the ensemble created by XGBoost is trained to fix the mistakes produced by the one before it. When training, XGBoost iteratively adds additional decision trees to the ensemble in order to optimise a loss function.
The gradient of the loss function with respect to each input feature is calculated by the algorithm to determine the best split points for each tree. With this method, XGBoost can manage enormous datasets and deliver cutting-edge performance on a range of workloads.
Using XGBoost, users can adjust a wide range of variables, including the learning rate, the maximum depth of each tree, and the number of trees in the ensemble. Additionally, it enables different forms of regularisation to reduce overfitting and boost generalisation efficiency. Additionally, to aid with hyperparameter tuning and avoid overfitting, XGBoost offers helpful features like integrated cross-validation and early stopping.
Numerous applications, such as online advertising, fraud detection, and natural language processing, have effectively exploited XGBoost. Its effectiveness, scalability, and versatility have led to its widespread adoption in both academia and industry.
Extreme Gradient Boosting, or XGBoost, is a supervised machine learning algorithm that boosts model accuracy using gradients. The algorithm builds a group of weak decision trees, each of which is trained to fix the mistakes caused by the one before it. The fundamental idea behind XGBoost is to reduce a loss function by expanding the ensemble with new decision trees that suit the residuals of the older ones. By repeatedly include fresh trees in the ensemble during training, XGBoost improves the loss function. The approach calculates the gradient of the loss function with respect to each input feature after each tree has been trained using a fraction of the data, and then chooses the optimum split points for each tree.
A single decision tree, a straightforward model that forecasts the target variable based on a collection of input data, is the first thing the algorithm builds. The subsequent decision tree is then trained using the residuals (the discrepancy between the predicted and actual values) from the first model. Up until the necessary number of trees is attained, the process of generating a decision tree and using its residuals to train the following tree is repeated.
XGBoost supports a number of regularisation techniques, including L1 and L2 regularisation, which penalise large weights or restrict the complexity of each tree, to avoid overfitting. Additionally, to aid with hyperparameter tuning and avoid overfitting, XGBoost offers features like integrated cross-validation and early stopping.
The predictions from each tree are combined using XGBoost to get the final output after the ensemble of trees has been trained. When dealing with classification problems, XGBoost transforms the predicted scores into class probabilities using a softmax function. When dealing with regression issues, XGBoost averages the anticipated values.
Overall, XGBoost is a strong and adaptable machine learning algorithm that excels at a wide range of tasks and can handle enormous datasets. Because of its effectiveness, scalability, and versatility, it is frequently utilised in both academia and industry.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
os.chdir("C:\\Users\\ASUS\\Desktop")
car_df = pd.read_csv('Car_Purchasing_Data.csv', encoding='ISO-8859-1')
car_df.head()
Customer Name | Customer e-mail | Country | Gender | Age | Annual Salary | Credit Card Debt | Net Worth | Car Purchase Amount | |
---|---|---|---|---|---|---|---|---|---|
0 | Martina Avila | cubilia.Curae.Phasellus@quisaccumsanconvallis.edu | Bulgaria | 0 | 41.851720 | 62812.09301 | 11609.380910 | 238961.2505 | 35321.45877 |
1 | Harlan Barnes | eu.dolor@diam.co.uk | Belize | 0 | 40.870623 | 66646.89292 | 9572.957136 | 530973.9078 | 45115.52566 |
2 | Naomi Rodriquez | vulputate.mauris.sagittis@ametconsectetueradip... | Algeria | 1 | 43.152897 | 53798.55112 | 11160.355060 | 638467.1773 | 42925.70921 |
3 | Jade Cunningham | malesuada@dignissim.com | Cook Islands | 1 | 58.271369 | 79370.03798 | 14426.164850 | 548599.0524 | 67422.36313 |
4 | Cedric Leach | felis.ullamcorper.viverra@egetmollislectus.net | Brazil | 1 | 57.313749 | 59729.15130 | 5358.712177 | 560304.0671 | 55915.46248 |
Here "Car Purchase Amount" is our dependent variable; we need to predict it based on other independent variables like Age, Annual Salary, Credit Card Debt etc.
car_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 500 entries, 0 to 499 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer Name 500 non-null object 1 Customer e-mail 500 non-null object 2 Country 500 non-null object 3 Gender 500 non-null int64 4 Age 500 non-null float64 5 Annual Salary 500 non-null float64 6 Credit Card Debt 500 non-null float64 7 Net Worth 500 non-null float64 8 Car Purchase Amount 500 non-null float64 dtypes: float64(5), int64(1), object(3) memory usage: 35.3+ KB
We understand that there are total 500 rows with 9 columns in the data. Customer Name, Customer e-mail and Country are the characters. All character variables should be checked since they might be required to be converted into dummy variables.
n = car_df.nunique(axis=0)
n
Customer Name 498 Customer e-mail 500 Country 211 Gender 2 Age 500 Annual Salary 500 Credit Card Debt 500 Net Worth 500 Car Purchase Amount 500 dtype: int64
We can see that the object (character) variables - Customer Name, Customer e-mail and Country are having 498, 500, and 211 unique values. It means that almost all the values are unique, and rarely we have common values. Hence, these variables are of no use to us as even if we create dummy variables out of it, we will hardly have any impact on the dependent variable. We will drop it.
car_df = car_df.drop(['Customer Name', 'Customer e-mail', 'Country'], axis = 1)
car_df.head(2)
Gender | Age | Annual Salary | Credit Card Debt | Net Worth | Car Purchase Amount | |
---|---|---|---|---|---|---|
0 | 0 | 41.851720 | 62812.09301 | 11609.380910 | 238961.2505 | 35321.45877 |
1 | 0 | 40.870623 | 66646.89292 | 9572.957136 | 530973.9078 | 45115.52566 |
import pandas as pd
import matplotlib.pyplot as plt
import random
import numpy as np
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
corr = car_df.corr()
thresh = 0
kot = corr[((corr>=thresh) | (corr<= -thresh))& (corr != 1)]
plt.figure(figsize=(10,3))
sns.heatmap(kot, cmap="Reds",annot=True)
<AxesSubplot:>
# We remove the label values from our training data
X = car_df.drop(['Car Purchase Amount'],axis=1)
# We assigned those label values to our Y dataset
y = car_df['Car Purchase Amount']
# Split it to a 70:30 Ratio Train:Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)
# Importing the Keras libraries and packages
import tensorflow as tf
### Initializing the ANN
ann = tf.keras.models.Sequential()
### Adding the input layer and the first hidden layer"""
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
### Adding the output layer"""#only 1 output hence 1 layer
ann.add(tf.keras.layers.Dense(units=1))
### Compiling the ANN
ann.compile(optimizer = 'adam', loss = 'mean_squared_error')
### Training the ANN model on the Training set"""
# first convert it into a numpy array since list doesn't work
import numpy as np
y_train = np.array(y_train)
X_train = np.array(X_train)
ann.fit(X_train, y_train, batch_size = 25, epochs = 100)
Epoch 1/100 14/14 [==============================] - 0s 846us/step - loss: 2254526208.0000 Epoch 2/100 14/14 [==============================] - 0s 769us/step - loss: 874682560.0000 Epoch 3/100 14/14 [==============================] - 0s 1ms/step - loss: 336855360.0000 Epoch 4/100 14/14 [==============================] - 0s 1ms/step - loss: 183744560.0000 Epoch 5/100 14/14 [==============================] - 0s 769us/step - loss: 132019112.0000 Epoch 6/100 14/14 [==============================] - 0s 923us/step - loss: 112405392.0000 Epoch 7/100 14/14 [==============================] - 0s 769us/step - loss: 94877496.0000 Epoch 8/100 14/14 [==============================] - 0s 846us/step - loss: 80443808.0000 Epoch 9/100 14/14 [==============================] - 0s 846us/step - loss: 69199248.0000 Epoch 10/100 14/14 [==============================] - 0s 923us/step - loss: 61142228.0000 Epoch 11/100 14/14 [==============================] - 0s 923us/step - loss: 56840748.0000 Epoch 12/100 14/14 [==============================] - 0s 769us/step - loss: 55082956.0000 Epoch 13/100 14/14 [==============================] - 0s 846us/step - loss: 54673260.0000 Epoch 14/100 14/14 [==============================] - 0s 769us/step - loss: 54076820.0000 Epoch 15/100 14/14 [==============================] - 0s 923us/step - loss: 53065920.0000 Epoch 16/100 14/14 [==============================] - 0s 846us/step - loss: 51914512.0000 Epoch 17/100 14/14 [==============================] - 0s 923us/step - loss: 51449984.0000 Epoch 18/100 14/14 [==============================] - 0s 846us/step - loss: 51118772.0000 Epoch 19/100 14/14 [==============================] - 0s 923us/step - loss: 50901304.0000 Epoch 20/100 14/14 [==============================] - 0s 692us/step - loss: 51026604.0000 Epoch 21/100 14/14 [==============================] - 0s 923us/step - loss: 50234812.0000 Epoch 22/100 14/14 [==============================] - 0s 923us/step - loss: 49995952.0000 Epoch 23/100 14/14 [==============================] - 0s 846us/step - loss: 49874324.0000 Epoch 24/100 14/14 [==============================] - 0s 846us/step - loss: 49665704.0000 Epoch 25/100 14/14 [==============================] - 0s 769us/step - loss: 49508592.0000 Epoch 26/100 14/14 [==============================] - 0s 769us/step - loss: 49688080.0000 Epoch 27/100 14/14 [==============================] - 0s 923us/step - loss: 49259988.0000 Epoch 28/100 14/14 [==============================] - 0s 923us/step - loss: 48994604.0000 Epoch 29/100 14/14 [==============================] - 0s 923us/step - loss: 49293124.0000 Epoch 30/100 14/14 [==============================] - 0s 923us/step - loss: 48767880.0000 Epoch 31/100 14/14 [==============================] - 0s 769us/step - loss: 48721324.0000 Epoch 32/100 14/14 [==============================] - 0s 923us/step - loss: 48790360.0000 Epoch 33/100 14/14 [==============================] - 0s 769us/step - loss: 48626628.0000 Epoch 34/100 14/14 [==============================] - 0s 846us/step - loss: 48443716.0000 Epoch 35/100 14/14 [==============================] - 0s 846us/step - loss: 48344956.0000 Epoch 36/100 14/14 [==============================] - 0s 846us/step - loss: 48129692.0000 Epoch 37/100 14/14 [==============================] - 0s 846us/step - loss: 48067148.0000 Epoch 38/100 14/14 [==============================] - 0s 769us/step - loss: 47960180.0000 Epoch 39/100 14/14 [==============================] - 0s 769us/step - loss: 48106928.0000 Epoch 40/100 14/14 [==============================] - 0s 3ms/step - loss: 48997680.0000 Epoch 41/100 14/14 [==============================] - 0s 923us/step - loss: 47792044.0000 Epoch 42/100 14/14 [==============================] - 0s 769us/step - loss: 47327528.0000 Epoch 43/100 14/14 [==============================] - 0s 769us/step - loss: 47432068.0000 Epoch 44/100 14/14 [==============================] - 0s 846us/step - loss: 47894848.0000 Epoch 45/100 14/14 [==============================] - ETA: 0s - loss: 37502864.00 - 0s 923us/step - loss: 47180852.0000 Epoch 46/100 14/14 [==============================] - 0s 846us/step - loss: 47228808.0000 Epoch 47/100 14/14 [==============================] - 0s 769us/step - loss: 47633572.0000 Epoch 48/100 14/14 [==============================] - 0s 769us/step - loss: 48302672.0000 Epoch 49/100 14/14 [==============================] - 0s 923us/step - loss: 47130052.0000 Epoch 50/100 14/14 [==============================] - 0s 923us/step - loss: 47047244.0000 Epoch 51/100 14/14 [==============================] - 0s 846us/step - loss: 47380172.0000 Epoch 52/100 14/14 [==============================] - 0s 923us/step - loss: 47480752.0000 Epoch 53/100 14/14 [==============================] - 0s 923us/step - loss: 47162876.0000 Epoch 54/100 14/14 [==============================] - 0s 769us/step - loss: 47209596.0000 Epoch 55/100 14/14 [==============================] - 0s 692us/step - loss: 47092436.0000 Epoch 56/100 14/14 [==============================] - 0s 769us/step - loss: 46708604.0000 Epoch 57/100 14/14 [==============================] - 0s 846us/step - loss: 47128296.0000 Epoch 58/100 14/14 [==============================] - 0s 923us/step - loss: 47467568.0000 Epoch 59/100 14/14 [==============================] - 0s 769us/step - loss: 47212180.0000 Epoch 60/100 14/14 [==============================] - 0s 692us/step - loss: 46686548.0000 Epoch 61/100 14/14 [==============================] - 0s 769us/step - loss: 46814136.0000 Epoch 62/100 14/14 [==============================] - 0s 923us/step - loss: 46909644.0000 Epoch 63/100 14/14 [==============================] - 0s 846us/step - loss: 46811908.0000 Epoch 64/100 14/14 [==============================] - 0s 846us/step - loss: 46746632.0000 Epoch 65/100 14/14 [==============================] - 0s 769us/step - loss: 46422720.0000 Epoch 66/100 14/14 [==============================] - 0s 769us/step - loss: 46900496.0000 Epoch 67/100 14/14 [==============================] - 0s 769us/step - loss: 46667608.0000 Epoch 68/100 14/14 [==============================] - 0s 846us/step - loss: 46425048.0000 Epoch 69/100 14/14 [==============================] - 0s 769us/step - loss: 46460028.0000 Epoch 70/100 14/14 [==============================] - 0s 769us/step - loss: 46731956.0000 Epoch 71/100 14/14 [==============================] - 0s 846us/step - loss: 46738132.0000 Epoch 72/100 14/14 [==============================] - 0s 846us/step - loss: 46763880.0000 Epoch 73/100 14/14 [==============================] - 0s 846us/step - loss: 46602308.0000 Epoch 74/100 14/14 [==============================] - 0s 769us/step - loss: 46635252.0000 Epoch 75/100 14/14 [==============================] - 0s 769us/step - loss: 46278932.0000 Epoch 76/100 14/14 [==============================] - 0s 846us/step - loss: 47537420.0000 Epoch 77/100 14/14 [==============================] - 0s 923us/step - loss: 47229988.0000 Epoch 78/100 14/14 [==============================] - 0s 846us/step - loss: 47287080.0000 Epoch 79/100 14/14 [==============================] - 0s 846us/step - loss: 46228960.0000 Epoch 80/100 14/14 [==============================] - 0s 846us/step - loss: 46520068.0000 Epoch 81/100 14/14 [==============================] - 0s 846us/step - loss: 46385520.0000 Epoch 82/100 14/14 [==============================] - 0s 846us/step - loss: 46577500.0000 Epoch 83/100 14/14 [==============================] - 0s 923us/step - loss: 46334468.0000 Epoch 84/100 14/14 [==============================] - 0s 923us/step - loss: 46291264.0000 Epoch 85/100 14/14 [==============================] - 0s 846us/step - loss: 45952480.0000 Epoch 86/100 14/14 [==============================] - 0s 769us/step - loss: 46196100.0000 Epoch 87/100 14/14 [==============================] - 0s 846us/step - loss: 46544616.0000 Epoch 88/100 14/14 [==============================] - 0s 846us/step - loss: 45891492.0000 Epoch 89/100 14/14 [==============================] - 0s 923us/step - loss: 46436244.0000 Epoch 90/100 14/14 [==============================] - 0s 846us/step - loss: 46673612.0000 Epoch 91/100 14/14 [==============================] - 0s 769us/step - loss: 46148052.0000 Epoch 92/100 14/14 [==============================] - 0s 769us/step - loss: 46131000.0000 Epoch 93/100 14/14 [==============================] - 0s 846us/step - loss: 46123468.0000 Epoch 94/100 14/14 [==============================] - 0s 923us/step - loss: 46411020.0000 Epoch 95/100 14/14 [==============================] - 0s 846us/step - loss: 46318272.0000 Epoch 96/100 14/14 [==============================] - 0s 769us/step - loss: 46558356.0000 Epoch 97/100 14/14 [==============================] - 0s 692us/step - loss: 46144964.0000 Epoch 98/100 14/14 [==============================] - 0s 769us/step - loss: 45992452.0000 Epoch 99/100 14/14 [==============================] - 0s 769us/step - loss: 46097556.0000 Epoch 100/100 14/14 [==============================] - 0s 692us/step - loss: 46090004.0000
<keras.callbacks.History at 0x248029e8>
y_pred = ann.predict(X_test)
y_test = y_test.tolist()
d = pd.DataFrame()
d["y_test"] = y_test
d["y_pred"] = y_pred
# MAPE
d["mp"] = (abs(d["y_test"]- d["y_pred"]))/d["y_test"]
(d.mp.mean())*100
13.022884808054105
# Importing the XGB libraries and packages
import xgboost as xg
# Instantiation
xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 100, seed = 123)
# Fitting the model
xgb_r.fit(X_train, y_train)
[09:36:09] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/objective/regression_obj.cu:188: reg:linear is now deprecated in favor of reg:squarederror.
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, objective='reg:linear', predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None)
y_pred = xgb_r.predict(X_test)
d = pd.DataFrame()
d["y_test"] = y_test
d["y_pred"] = y_pred
# MAPE
d["mp"] = abs((d["y_test"]- d["y_pred"])/d["y_test"])
(d.mp.mean())*100
3.632458679991241
The readers of this blog might mail their opinion how to further improve the model and you will get our contact details here.
To know about all our courses please click here
If you want to read more such case studies then click on Whom should you ask for donations for a charity or Identify if a patient has cancer
Regression problems can be found at House Price Prediction and Insurance Premium Prediction
How to use Machine Learning in Real Estate companies, How to predict the price of 2nd hand cars
Learn Pandas Group by function or How to get a job in Data Science